Augmenting Focused Crawling Using Search Engine Queries

نویسندگان

  • XUAN WANG
  • MIN-YEN KAN
چکیده

The pervasiveness of the Internet makes it an ideal medium for sharing scholarly information. Nowadays, many authors post their publications online so that others may easily access to them, increasing the author’s impact in his/her research area. In this project, we develop a focused crawling to find publication pages, web pages that link to online, freely available scholarly publications. In contrast to previous works which only traverse hyperlinks within web pages, our algorithm leverages search engine queries to locate suitable pages for crawling. This strategy allows our crawler to locate more relevant pages and lessens the reliance of the crawler on the quality of seed pages used to start the crawling processing. Our crawler is also able to locate relevant pages that are not accessible by standard crawling that work only by following hyperlinks. Our results show our system is able to avoid slow start, and find publication pages faster, outperform local crawling methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Augmenting Focused Crawling using Search Engine Queries

.......................................................................................................................III Acknowledgement....................................................................................................... IV Table of

متن کامل

Focused Crawling using Asynchronous Cellular Learning Automata

Web crawling is used to collect the web pages which will be indexed by a search engine. The search engine uses these crawled and indexed pages to answer users’ queries. Since the volume of web pages is very high and it increases continuously, search engines can index a limited number of web pages. Therefore, in recent years, the focused crawler algorithms have been introduced which act selectiv...

متن کامل

Collecte orientée sur le Web pour la recherche d'information spécialisée. (Focused document gathering on the Web for domain-specific information retrieval)

Focused document gathering on the Web for domain-specific information retrieval Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithm...

متن کامل

A New Approach Towards Vertical Search Engines - Intelligent Focused Crawling and Multilingual Semantic Techniques

Search engines typically consist of a crawler which traverses the web retrieving documents and a search frontend which provides the user interface to the acquired information. Focused crawlers refine the crawler by intelligently directing it to predefined topic areas. The evolution of search engines today is expedited by supplying more search capabilities such as a search for metadata as well a...

متن کامل

DHT-Based Distributed Crawler

A search engine, like Google, is built using two pieces of infrastructure a crawler that indexes the web and a searcher that uses the index to answer user queries. While Google's crawler has worked well, there is the issue of timeliness and the lack of control given to end-users to direct the crawl according to their interests. The interface presented by such search engines is hence very limite...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006